Microbiome data integration workflow for population cohort
studies
# Get starting time
starting_time <- Sys.time()
################################################################################
# List of packages that we need
packages <- c("ggplot2", "mia", "miaViz")
# Get packages that are already installed installed
packages_already_installed <- packages[ packages %in% installed.packages() ]
# Get packages that need to be installed
packages_need_to_install <- setdiff( packages, packages_already_installed )
# Loads BiocManager into the session. Install it if it not already installed.
if( !require("BiocManager") ){
install.packages("BiocManager")
library("BiocManager")
}
# If there are packages that need to be installed, installs them with BiocManager
# Updates old packages.
if( length(packages_need_to_install) > 0 ) {
install(packages_need_to_install, ask = FALSE)
}
# Load all packages into session. Stop if there are packages that were not
# successfully loaded
if( any(!sapply(packages, require, character.only = TRUE)) ){
stop("Error in loading packages into the session.")
}
################################################################################
# Additional setup
# Set black and white theme for figures, and Arial font
theme <- theme_bw() + theme(text = element_text(family = "Arial"),
panel.border = element_blank(),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
axis.line = element_line(colour = "black"))
theme_set(theme)
All authors are affiliated to Turku Data Science Group in
University of Turku, Finland.
# Plot publication graph
path <- "data/PubMed_Timeline_Results_by_Year.csv"
df <- read.csv(path, skip = 1)
x <- "Year"
y <- "Count"
plot <- ggplot(df, aes(x = .data[[x]], y = .data[[y]])) +
geom_bar(stat="identity")
plot
PubMed publications per year with a search term ‘microbiome’ (fetched: Sep 5, 2023)
Regarding the choice of the data strcuture to handle cohorts with muliple data types or sources MultiAssayExperiment, is considered to be very liable in such situation, easing-up the management and wrangling of the data; once is construced correctly. Moreover, several R packages frameworks are increasingly integrating MultiAssayExperiment and the SummarizedExperimentclass, to provide users with a reliable and ease of use data structure.
We get the data from MGnify database. It is a EMBL-EBI’s database for metagenomic data. This large microbiome database can be accessed with MGnifyR package which nowadays support TreeSE. The package will be submitted to Bioconductor’s next release.
We chose this dataset…
As loading takes some time, the dataset is already loaded.
For other available datasets and importing methods, check OMA.
# library(MGnifyR)
# mg <- MgnifyClient()
#
# analyses <- searchAnalysis(mg, "studies", "MGYS00005128")
# analyses <- searchAnalysis(mg, "studies", "MGYS00000596")
# mae <- getResult(mg, analyses)
sessionInfo()
## R version 4.3.0 (2023-04-21)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 20.04.6 LTS
##
## Matrix products: default
## BLAS/LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.8.so; LAPACK version 3.9.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=fi_FI.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=fi_FI.UTF-8 LC_MESSAGES=en_US.UTF-8 LC_PAPER=fi_FI.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C LC_MEASUREMENT=fi_FI.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Europe/Helsinki
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] miaViz_1.9.0 ggraph_2.1.0 mia_1.9.9
## [4] MultiAssayExperiment_1.27.0 TreeSummarizedExperiment_2.9.0 Biostrings_2.69.2
## [7] XVector_0.41.1 SingleCellExperiment_1.23.0 SummarizedExperiment_1.31.1
## [10] Biobase_2.61.0 GenomicRanges_1.53.1 GenomeInfoDb_1.37.2
## [13] IRanges_2.35.2 S4Vectors_0.39.1 BiocGenerics_0.47.0
## [16] MatrixGenerics_1.13.1 matrixStats_1.0.0 ggplot2_3.4.2
## [19] BiocManager_1.30.21.1
##
## loaded via a namespace (and not attached):
## [1] rstudioapi_0.15.0 jsonlite_1.8.7 magrittr_2.0.3 ggbeeswarm_0.7.2
## [5] farver_2.1.1 rmarkdown_2.23 zlibbioc_1.47.0 vctrs_0.6.3
## [9] memoise_2.0.1 DelayedMatrixStats_1.23.0 RCurl_1.98-1.12 ggtree_3.9.0
## [13] rstatix_0.7.2 htmltools_0.5.5 S4Arrays_1.1.5 BiocNeighbors_1.19.0
## [17] broom_1.0.5 gridGraphics_0.5-1 SparseArray_1.1.11 sass_0.4.7
## [21] bslib_0.5.0 htmlwidgets_1.6.2 plyr_1.8.8 DECIPHER_2.29.0
## [25] cachem_1.0.8 igraph_1.5.0.1 lifecycle_1.0.3 pkgconfig_2.0.3
## [29] rsvd_1.0.5 Matrix_1.6-0 R6_2.5.1 fastmap_1.1.1
## [33] GenomeInfoDbData_1.2.10 aplot_0.1.10 digest_0.6.33 ggnewscale_0.4.9
## [37] colorspace_2.1-0 patchwork_1.1.2 scater_1.29.3 irlba_2.3.5.1
## [41] RSQLite_2.3.1 ggpubr_0.6.0 vegan_2.6-4 beachmat_2.17.15
## [45] labeling_0.4.2 fansi_1.0.4 polyclip_1.10-4 abind_1.4-5
## [49] mgcv_1.9-0 compiler_4.3.0 bit64_4.0.5 withr_2.5.0
## [53] backports_1.4.1 BiocParallel_1.35.3 carData_3.0-5 viridis_0.6.4
## [57] DBI_1.1.3 highr_0.10 ggforce_0.4.1 ggsignif_0.6.4
## [61] MASS_7.3-60 DelayedArray_0.27.10 bluster_1.11.4 permute_0.9-7
## [65] tools_4.3.0 vipor_0.4.5 beeswarm_0.4.0 ape_5.7-1
## [69] glue_1.6.2 nlme_3.1-162 grid_4.3.0 cluster_2.1.4
## [73] reshape2_1.4.4 generics_0.1.3 gtable_0.3.3 tidyr_1.3.0
## [77] BiocSingular_1.17.1 tidygraph_1.2.3 ScaledMatrix_1.9.1 car_3.1-2
## [81] utf8_1.2.3 ggrepel_0.9.3 pillar_1.9.0 stringr_1.5.0
## [85] yulab.utils_0.0.6 splines_4.3.0 tweenr_2.0.2 dplyr_1.1.2
## [89] treeio_1.25.2 lattice_0.21-8 bit_4.0.5 tidyselect_1.2.0
## [93] DirichletMultinomial_1.43.0 scuttle_1.11.2 knitr_1.43 gridExtra_2.3
## [97] xfun_0.39 graphlayouts_1.0.0 DT_0.28 stringi_1.7.12
## [101] ggfun_0.1.1 lazyeval_0.2.2 yaml_2.3.7 evaluate_0.21
## [105] codetools_0.2-19 tibble_3.2.1 ggplotify_0.1.1 cli_3.6.1
## [109] munsell_0.5.0 jquerylib_0.1.4 Rcpp_1.0.11 parallel_4.3.0
## [113] blob_1.2.4 sparseMatrixStats_1.13.0 bitops_1.0-7 decontam_1.21.0
## [117] viridisLite_0.4.2 tidytree_0.4.4 scales_1.2.1 purrr_1.0.1
## [121] crayon_1.5.2 rlang_1.1.1